Succinct data structures for flexible text retrieval systems
نویسنده
چکیده
We propose succinct data structures for text retrieval systems supporting document listing queries and ranking queries based on the tf*idf (term frequency times inverse document frequency) scores of documents. Traditional data structures for these problems support queries only for some predetermined keywords. Recently Muthukrishnan proposed a data structure for document listing queries for arbitrary patterns at the cost of data structure size. For computing the tf*idf scores there has been no efficient data structures for arbitrary patterns. Our new data structures support these queries using small space. The space is only 2/ times the size of compressed documents plus 10n bits for a document collection of length n, for any 0 < ≤ 1. This is much smaller than the previous O(n log n) bit data structures. Query time is O(m+q log n) for listing and computing tf*idf scores for all q documents containing a given pattern of length m. Our data structures are flexible in a sense that they support queries for arbitrary patterns.
منابع مشابه
Succinct Data Structures for NLP-at-Scale
Succinct data structures involve the use of novel data structures, compression technologies, and other mechanisms to allow data to be stored in extremely small memory or disk footprints, while still allowing for efficient access to the underlying data. They have successfully been applied in areas such as Information Retrieval and Bioinformatics to create highly compressible in-memory search ind...
متن کاملCell probe lower bounds for succinct data structures
In this paper, we consider several static data structure problems in the deterministic cell probe model. We develop a new technique for proving lower bounds for succinct data structures, where the redundancy in the storage can be small compared to the informationtheoretic minimum. In fact, we succeed in matching (up to constant factors) the lower order terms of the existing data structures with...
متن کاملEngineering a Distributed Full-Text Index
We present a distributed full-text index for big data applications in a distributed environment. The index can be used to answer different types of pattern matching queries (existential, counting and enumeration) and also be extended to answer document retrieval queries (counting, retrieve and top-k). We also show that succinct data structures are indeed useful for big data applications, as the...
متن کاملA Framework for Dynamizing Succinct Data Structures
We present a framework to dynamize succinct data structures, to encourage their use over non-succinct versions in a wide variety of important application areas. Our framework can dynamize most stateof-the-art succinct data structures for dictionaries, ordinal trees, labeled trees, and text collections. Of particular note is its direct application to XML indexing structures that answer subpath q...
متن کاملNew algorithms on wavelet trees and applications to information retrieval
Wavelet trees are widely used in the representation of sequences, permutations, text collections, binary relations, discrete points, and other succinct data structures. We show, however, that this still falls short of exploiting all of the virtues of this versatile data structure. In particular we show how to use wavelet trees to solve fundamental algorithmic problems such as range quantile que...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- J. Discrete Algorithms
دوره 5 شماره
صفحات -
تاریخ انتشار 2007